{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M6.04\n",
    "\n",
    "The aim of the exercise is to get familiar with the histogram\n",
    "gradient-boosting in scikit-learn. Besides, we will use this model within a\n",
    "cross-validation framework in order to inspect internal parameters found via\n",
    "grid-search.\n",
    "\n",
    "We will use the California housing dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n",
    "target *= 100  # rescale the target in k$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, create a histogram gradient boosting regressor. You can set the trees\n",
    "number to be large, and configure the model to use early-stopping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.ensemble import HistGradientBoostingRegressor\n",
    "\n",
    "hist_gbdt = HistGradientBoostingRegressor(\n",
    "    max_iter=1000, early_stopping=True, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use a grid-search to find some optimal parameter for this model. In\n",
    "this grid-search, you should search for the following parameters:\n",
    "\n",
    "* `max_depth: [3, 8]`;\n",
    "* `max_leaf_nodes: [15, 31]`;\n",
    "* `learning_rate: [0.1, 1]`.\n",
    "\n",
    "Feel free to explore the space with additional values. Create the grid-search\n",
    "providing the previous gradient boosting instance as the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "params = {\n",
    "    \"max_depth\": [3, 8],\n",
    "    \"max_leaf_nodes\": [15, 31],\n",
    "    \"learning_rate\": [0.1, 1],\n",
    "}\n",
    "\n",
    "search = GridSearchCV(hist_gbdt, params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will run our experiment through cross-validation. In this regard,\n",
    "define a 5-fold cross-validation. Besides, be sure to shuffle the data.\n",
    "Subsequently, use the function `sklearn.model_selection.cross_validate` to run\n",
    "the cross-validation. You should also set `return_estimator=True`, so that we\n",
    "can investigate the inner model trained via cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.model_selection import KFold\n",
    "\n",
    "cv = KFold(n_splits=5, shuffle=True, random_state=0)\n",
    "results = cross_validate(\n",
    "    search,\n",
    "    data,\n",
    "    target,\n",
    "    cv=cv,\n",
    "    return_estimator=True,\n",
    "    # n_jobs=2  # Uncomment this if you run locally\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we got the cross-validation results, print out the mean and standard\n",
    "deviation score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "print(\n",
    "    \"R2 score with cross-validation:\\n\"\n",
    "    f\"{results['test_score'].mean():.3f} \u00b1 \"\n",
    "    f\"{results['test_score'].std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then inspect the `estimator` entry of the results and check the best\n",
    "parameters values. Besides, check the number of trees used by the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "for estimator in results[\"estimator\"]:\n",
    "    print(estimator.best_params_)\n",
    "    print(f\"# trees: {estimator.best_estimator_.n_iter_}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We observe that the parameters are varying. We can get the intuition that\n",
    "results of the inner CV are very close for certain set of parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspect the results of the inner CV for each estimator of the outer CV.\n",
    "Aggregate the mean test score for each parameter combination and make a box\n",
    "plot of these scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import pandas as pd\n",
    "\n",
    "index_columns = [f\"param_{name}\" for name in params.keys()]\n",
    "columns = index_columns + [\"mean_test_score\"]\n",
    "\n",
    "inner_cv_results = []\n",
    "for cv_idx, estimator in enumerate(results[\"estimator\"]):\n",
    "    search_cv_results = pd.DataFrame(estimator.cv_results_)\n",
    "    search_cv_results = search_cv_results[columns].set_index(index_columns)\n",
    "    search_cv_results = search_cv_results.rename(\n",
    "        columns={\"mean_test_score\": f\"CV {cv_idx}\"}\n",
    "    )\n",
    "    inner_cv_results.append(search_cv_results)\n",
    "inner_cv_results = pd.concat(inner_cv_results, axis=1).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
    "inner_cv_results.plot.box(vert=False, color=color)\n",
    "plt.xlabel(\"R2 score\")\n",
    "plt.ylabel(\"Parameters\")\n",
    "_ = plt.title(\n",
    "    \"Inner CV results with parameters\\n\"\n",
    "    \"(max_depth, max_leaf_nodes, learning_rate)\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We see that the first 4 ranked set of parameters are very close. We could\n",
    "select any of these 4 combinations. It coincides with the results we observe\n",
    "when inspecting the best parameters of the outer CV."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}